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1.1. Executive Summary 

This research program focused on the development of optoelectronic interconnection networks 
that combine communication and processing capabilities in network hardware to accelerate 
distributed computing applications. 

On the architecture front, we have designed an optoelectronic hardware module that can be 
used as a building block of smart networks with application-specific performance and cost 
requirements. A detailed technological comparison between the optoelectronic design and an 
equivalent advanced electronic MCM implementation was carried out to assess the advantages 
provided by the optoelectronic solution. The results of our work are incorporated into a 
prototype optoelectronic switch currently being built at Bell Laboratories, a division of Lucent 
Technologies. 

On the hardware front, we have collaborated with Bell Laboratories to demonstrate a 2Kbit, 
50Mpage/s, photonic first-in, first-out page buffer based on GaAs/AlGaAs multiple quantum 
well diodes flip-chip bonded to sub-micron CMOS circuits. This photonic chip provided a 
number of breakthroughs in the area of optical interconnects, including: 

• First implementation of a hybrid 850nm GaAs MQW/CMOS VLSI circuit with modulators 
bonded directly on top of active silicon circuits. 

• First demonstration of a hybrid 850nm GaAs MQW/CMOS transimpedance 
transmitter/receiver circuit operating at 375 Mbis/sec with switching energy of =370 
femto-joules . 

• Design and implementation of a high-density 2Kbit photonic first-in, first-out page buffer 
circuit with optical input and output functionality. 

• Measurement of ring oscillator circuits loaded with hybrid MQW devices operating at 
2GHZ 
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2. Smart Network Architecture 

This section describes three multistage interconnection network (MIN) architectures that we 
have developed for use with optoelectronic technology. These architectures target different 
applications and therefore have distinct cost, performance and functionality characteristics. All 
the architectures can be built using a common hardware module. This module is a perfect 
shuffle MIN that utilizes simple processing elements (less than 200 logic gates per PE). 
Previously, we have shown that optoelectronic (e.g. smart-pixel) implementation of such 
modules is highly efficient, while electronic implementations suffer in scalability due to the large 
number of wire crossovers required in the perfect shuffle interconnection and their length . 
Our approach allows a single optoelectronic packaging scheme to be applied in several 
applications, thus lowering the hardware development cost and increasing the number of 
potential users. 

2.1. Introduction 

In the past decade much interest has been generated in the use of self-routing multistage 
interconnection networks (MINs) for high-performance packet-switched interconnection 
networks for telecommunications in the form of asynchronous transfer mode (ATM) switches 
and internal networks for massively parallel computers 3,4 . The basic appeal of MINs lies in their 
implicit simplicity and their scalability to a large number of ports. Unfortunately, the scalability 
potential of electronic implementations of these networks is often overshadowed by physical 
packaging constraints in the form of limited chip pin-outs, connector limitations on PCB s, 
backplane interconnection density, and for high-speed systems, signal integrity and latency 
characteristics 2,5 . 

Previously, a number of researchers have proposed that MINs can be efficiently implemented 
using 2-D optoelectronic processor arrays interconnected with free-space optical 
links 6,7,8 ’ 9 ’ 10,11,12 . Moreover, several research groups are building system prototypes of such 
optoelectronic MINs 13 . For example, we have designed an optoelectronic perfect shuffle MIN 
and shown that this design outperforms both chip-level and multichip module (MCM) level 
electronic implementations 2,14 . 

In this program, we have used our previous hardware design as the building block (or module) 
for three interconnection network architectures with distinct application, performance and cost 
requirements. The first architecture, called the tandem banyan network, provides one-to-one 
communication and is well suited for uniform traffic (e.g. all output ports are equally likely to 
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be used). While this architecture achieves low cost and low latency, it has very poor 
performance in the presence of "hot-spot" traffic (where some output ports are more likely to 
be used) and it does not support one-to-many (or broadcast) communication. The second 
architecture, called the smart network, solves these problems at the expense of higher hardware 
cost and higher latency. The third architecture, called the hierarchical network, reduces the 
smart network latency at the expense of higher hardware cost. 

2.2. Application Requirements 

This section describes the application targeted by the architectures presented in this paper. 
Nominally, this application is ATM which is emerging as a standard networking scheme for 
high-speed computer interconnection both at LAN and WAN levels 15 . In the near future, ATM 
networks will carry traditional computer data traffic, such as data files and email messages, as 
well as video, voice and data traffic associated with distributed computation performed by a 
group of processors attached to the network. 

The experimental switch activities of telecommunications vendors is currently focused (for the 
most part) on electronic ATM switch systems providing small numbers of ports (typically 16- 
64) operating at 155 and 622 Mbps each. Some Japanese telecommunication vendors have 
demonstrated switches operating at 1.8 Gbps and 2.5Gbps implemented with GaAs ICs and 
advanced MCM packaging technologies 16,17 ’ 18 . All these systems are typically based on the 
crossbar, shared memory, or shared medium (e.g. bus and ring) switch architectures. While 
these architectures are adequate for today's networking applications, scaling them to meet 
future switching demands will present a formidable challenge. The challenge is to efficiently 
implement switches with large number of physical ports (1K-10K ports) operating at gigabit 
data rates (1-10 Gbps/port) and having 1 to 10 terabit/second aggregate bandwidth capacities. 

There are substantiated engineering tradeoffs to take into consideration when deciding on a 
switch architecture that has to scale to over 1000 physical ports and operate at gigabit port data 
rates. Physical packaging issues become very important. Technologies, architectures and 
systems which have worked well for a 64 port switch operating at 155Mbps, are often 
impractical for a 1000 port switch operating at lGbps. For example, both the interconnect and 
circuit complexity of a crossbar switch architecture grows as 0 ( n 2 ), making it impractical for 

network sizes of 1000 and above. Likewise, both shared memory and shared medium 
architectures suffer from a performance degradation as the number of channels is increased. 
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Figure 1. A switching network consists of the switch fabric, line interfaces and 
controller. 

Figure 1 shows a typical switching network architecture where the network nodes (e.g. 
processors, memories, or specialized devices) are attached to the interconnection network (or 
switch fabric) via line interfaces. On the input side, line interfaces segment incoming data 
messages into fixed size cells (or packets) for transmission through the switch fabric. The self¬ 
routing switch fabric routes these cells between input and output ports. On the output side, line 
interfaces reassemble the original data message. In addition to cell segmentation and 
reassembly, line interfaces also incorporate I/O interface, cell buffering, and network protocol 
functions. Finally, the system controller is used for higher level functions such as network 
management and testing. 

Although line interfaces are an important part of a complete network design, the architectures 
described in this paper implement the switch fabric portion of the switching network. Design 
issues for switch fabrics include types of provided communication services, cell blocking (or 
cell loss rate), guaranteed cell delivery, latency and cell priority. Typically, switch fabrics are 
engineered to meet application-specific performance and cost requirements. For example, in this 
paper we describe three switch fabric architectures that target distinct performance and cost 
requirements. 

The two types of cell blocking that can occur in ATM switch fabrics are internal link blocking 
and output port blocking. Internal link blocking occurs for switch fabrics that cannot support all 
possible interconnections. In this situation, it is possible that two cells will simultaneously 
compete for the same link and one of the cells has to be discarded or buffered for later 
transmission. Output port blocking is unavoidable in self-routing switch fabrics because several 
input ports can simultaneously send a cell to the same output port. Typically, networks are 
engineered to allow small amounts of blocking (e.g. <10‘ 9 ) for a given distribution of incoming 
traffic (e.g. uniform, community of interest, bursty, etc.). 
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Guaranteed cell delivery is critical for multimedia applications that require a sustained 
bandwidth to be maintained between the network devices at any time during the connection. 
This is closely related to cell priority which allows cells with higher priority to have precedence 
over cells with lower priority (e.g. high priority traffic is delivered first) when blocking occurs. 
Finally, latency is important when the switch fabric is used for distributed computing, in which 
case lower communication latency yields more efficient parallel processing. 

Cell traffic through the switch fabric can be divided in two categories: communication traffic 
and synchronization traffic. Communication traffic transfers cells between input and output 
ports. Typical communication traffic consists of one-to-one and one-to-many (e.g. multicast or 
broadcast) cell traffic. In one-to-one traffic, cells are send from a source port to a single 
destination port. In multicast traffic, a source port simultaneously sends the same cell to many 
destination ports. Synchronization traffic occurs in distributed applications 19 where a software 
program is partitioned into a set of cooperating processes that run concurrently on different 
processors and communicate using message-passing over the interconnection network. Unless 
properly handled, synchronization traffic can lead to "hot-spot" network congestion 20 as 
described in the following example. 

To illustrate synchronization traffic, consider a parallel implementation of a loop with M 
iterations, followed by a sequential code portion. We can have M processors executing the M 
iterations of the loop in parallel, but the sequential portion of the code has to wait until all M 
processors are finished. In a shared memory computer, this type of synchronization is 
implemented by having each processor increment a shared memory variable. The processor 
containing the serial code checks the variable to decide when it can execute. The problem arises 
when all the processors finish and send M messages to increment the same shared variable. 
Since the interconnection network has only one output port to the memory containing the 
shared variable, the updates must be done serially, creating a performance bottleneck. This 
phenomenon is called the synchronization bottleneck 21 (or MSYPS limit). 

One approach to eliminating the synchronization bottleneck is not to parallelize the code that 
requires extensive use of synchronization operations. This approach cannot be used in 
distributed computing, because synchronization operations are inherent in distributed systems 
and are used for parallel resource scheduling and allocation 22 . Thus a method of efficiently 
performing synchronization has to be implemented in the network hardware to allow 
high-performance distributed computing. 
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Figure 2. Optoelectronic hardware module used to construct the MIN architectures presented in 
this paper. Figure A shows the schematic diagram of the design, while figures B and C show 
the physical implementation using optically interconnected multichip modules (reference 2). 
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2.3. Optoelectronic Hardware Module 

Figure 2 shows the optoelectronic hardware module that is used as a building block of the 
architectures described in this paper. It consists of optoelectronic processor arrays connected in 
tandem using free-space optical links. The ffee-space links always use a 2-D shuffle exchange 
topology. Each processor array contains identical and simple processing elements (less than 200 
logic gates/PE) uniformly spaced on a 2-D grid and having optical input and output ports. In 
our design approach, the logic function performed by the processing elements is architecture 
dependent, while the optical interconnection remains fixed. For example, although the 
architectures described in this paper use 8 distinct processor array types, they all rely on the 
same 2-D optical shuffle exchange to interconnect the arrays. This approach allows the same 
optoelectronic packaging scheme to be used to build the entire interconnection network thereby 
reducing fabrication and design costs. 

The detailed hardware design of this module Was previously described in references 2 and 14 
and will not be repeated here. Cascading log A , N module stages produces a network that is 
functionally equivalent to an N channel perfect shuffle MIN, where K is a design parameter. 
The perfect shuffle MIN, shown in figure 3, uses log 2 N stages of switching elements. Each 
stage contains N/2 switching elements with 2 input and 2 output ports (see figure 4). Cells enter 
the switching elements in a particular stage, bit and frame aligned. Each switching element 
receives two incoming cells, examines the information contained in their cell headers, and 
routes them to the appropriate output port. 

2.4. The Tandem Banyan Network 

The tandem banyan MIN architecture was originally developed for electronic implementation 
(see figure 5). The basic idea behind this architecture is to repeat cell routing through a banyan 
network and after each routing attempt, remove cells that have been successfully routed. The 
tandem banyan network can be built using our optoelectronic hardware module because it is 
based on a topology that is equivalent to the perfect shuffle. The tandem banyan MIN provides 
one-to-one communication and cell priority services (e.g. higher priority cells have lower 
latency). It is well suited for local area computer networks, because of relatively low latency 
and low cost. A detailed description of the tandem banyan design and performance can be 
found in reference . 
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figure shows the tandem banyan for R=2 and N=16. 

Figure 6 shows the cell loss rate (defined as the probability of a cell being misrouted due to 
internal link blocking) for a 1024 port tandem banyan network as a function of the number of 
banyans in tandem (R). It can be seen that the cell loss rate can be made arbitrarily small by 
increasing R. For example, for R=8, the cell loss rate is near 10' 5 and the number of stages is 
80. Assuming that each banyan stage has a latency of 3 clock cycles (e.g. 1 cycle for the 
activity bit, 1 cycle for the priority bit, and 1 cycle for the routing bit), then the worst case 
latency of an R=8 tandem banyan network is 240 clock cycles (e.g. 
& banyans • I0 s,age / banyan • 3°*% ge ). On the other hand the best case latency is 30 clock cycles (e.g. 
1 banyan -10^%^-3^/, age ). The average latency is 90 clock cycles (or 3 banyans in tandem) 
as determined by computer simulation. 

A major shortcoming of the tandem banyan network is its inability to handle "hot-spot" traffic. 
Figure 7 shows the curves for cell loss rate of a 1024 tandem banyan network where 5% and 
10% of all incoming cells are directed to a single output port while the remaining cells are 
uniformly distributed. It can be seen that the cell loss rate with "hot spot" traffic is much higher 
than the cell loss rate with uniform traffic (superimposed on the same plot). In fact, the "hot¬ 
spot" cell loss rate saturates near 1(H even as R is increased to 10. This leads us into the next 
section, which enhances the tandem banyan network to resolve "hot-spot" traffic that arises due 
to synchronization operations. 

2.5. The Smart Network 

This section describes a new MIN architecture, called the smart network. The smart network 
architecture retains all the capabilities of the tandem banyan MIN while improving performance 
under "hot spot" traffic conditions and allowing broadcast communication. These additional 
capabilities are achieved at the expense of increased hardware costs and higher latency. The 
smart network is constructed by repeatedly cascading the basic optoelectronic hardware module 
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described in section 3. Unlike the tandem banyan network, which uses a single processor array 

chip design, the smart network requires 7 different processor array designs. 
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Figure 6. Cell loss rate under uniform traffic for N=1024 port tandem banyan network. The 
vertical axis is the cell loss rate, while the horizontal axis is R, the number of banyans in 
tandem. Input link load is 100%. 
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Figure 7. Cell loss rate under "hot-spot" traffic for N=1024 port tandem banyan network. The 
two topmost curves show cell loss rate with 5% and 10% "hot spot" traffic. The lowest curve 
shows uniform traffic cell loss rate and is included as a reference. Input link load is 100%. 

In the next section we describe the types of synchronization operations that can be efficiently 
handled by the smart network. These operations are typical of distributed computing and would 
cause severe "hot-spot" congestion if carried over the tandem banyan network. 

2.5.1. Smart Network Operations 

Four types of operations are supported in the smart network architecture. These operations 
include one-to-one communication, broadcast communication and two types of synchronization 
operations. The first operation, one-to-one communication, is implemented using the tandem 
banyan network (see section 4) and will not be discussed here. 

The broadcast operation is implemented as a user-initiated service 24 . With this approach, the 
originator of the broadcast transmits the master packet. This packet contains the data to be 
broadcast. At the same time, the recipients of the broadcast send copy packets into the 
network. Inside the network, the contents of the master packet are copied into the copy 
packets. Finally the copy packets are delivered back to their original senders. In other words, 
the input ports that participate in a broadcast operation, simply send packets to themselves. One 
packet is designated the master packet while the rest are called the copy packets. The smart 
network copies the contents of the master packet into the copy packets while the messages are 
en-route to their destination. 

To allow multiple broadcast transmissions to occur simultaneously with one-to-one 
communication, the smart network provides special packet header fields called the group 
address and the COPY instruction fields. With this approach, multiple master packets can be 
simultaneously transmitted provided that they use unique group address values. To receive a 
copy of a specific master packet, the input port simply sends itself packet containing the group 
address of the master packet that it wishes to copy. To differentiate broadcast and one-to-one 
traffic, all broadcast packets contain 1 in the COPY instruction field. On the other hand, all 
one-to-one packets contain 0 in that field. 

The remaining operations, fanin and partial-sum, are similar to broadcast transmission. Both of 
these operations use the group address and instruction fields in the packet header. Again, the 
input ports that participate in the operation, simply send packets to themselves. Their group 
address field is set to a unique and predetermined number while the appropriate instruction field 
is set to 1. The smart network then performs the requested synchronization operation while the 
messages are en-route to their destination. 

The fanin operation allows packets that are sent to the same destination output port to be 
combined inside the interconnection network such that only one packet is delivered at the 
output. This operation is useful in distributed computing because many parallel algorithms 
depend on barrier synchronization 14 which require that all the processors involved in the 
computation send a completion status message to a specific processor to determine whether a 
solution has been found. If the fanin operation is not implemented in the network hardware, 
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then packets sent to the same output port are delivered sequentially because a network output 
port can only accept one packet at a time. Thus when a large number of packets are sent to the 
same output port, a serious performance bottleneck occurs. In order to combine the packets, 
one needs to specify the mathematical function that is to be performed on the packet payloads 
when packets are combined. Functions useful for synchronization purposes are AND, OR, 
MAX and MIN. 

The partial sum operation allows the implementation of the fetch-and-add synchronization 
operation which has been found useful for many application in distributed computing 25 . The 
basic idea behind this operation is that the processors send a packet containing their number 
into the network and receive a packet that contains the partial sum of the numbers. A detailed 
description of the fetch-and-add operation and its usage can be found in reference 14. The 
absence of the partial sum operation would lead to a serious performance bottleneck, especially 
when large number of processors are involved. 

2.5.2. Smart Network Architecture 

Previously, a shuffle-based MIN architecture has been developed to support synchronization 
operations and to reduce internal blocking for the NYU ultracomputer project 26 . This 
architecture uses a bi-directional perfect shuffle MIN. Complex switching elements implement 
the necessary logic for performing synchronization operations and provide packet buffering in 
case of internal contention. Although this architecture is well suited to VLSI implementation, it 
is not efficient with our optoelectronic hardware module. As shown in reference 14, the use of 
complex switching elements in the optoelectronic MIN leads to low system performance and 
high hardware cost. 

On the other hand, in the field of telecommunications a non-blocking interconnection network, 
called the STARLITE 8 ’ 10 , has been developed. The STARLITE supports packet priority, one- 
to-one communication and broadcast communication. As shown in figure 8, the basic 
STARLITE design uses 4 cascaded MINs. The first three MINs (group sorting, copy and mark 
networks) implement broadcast communication and packet priority services. The fourth MIN is 
a batcher-banyan network that implements non-blocking one-to-one communication service. 
Although the STARLITE does not support synchronization services, it is well suited for 
optoelectronic implementation because it uses simple switching elements interconnected with 
the perfect shuffle topology. 

To enable STARLITE to perform synchronization services, we add four additional MINs and 
replace the batcher-banyan MIN with the tandem banyan MIN. The resulting network, called 
the smart network, is shown in figure 8. The tandem banyan MIN portion of the smart network 
is called the communication section because it handles one-to-one communication traffic. The 
remaining group of MINs are called the processing section because they handle synchronization 
operations and one-to-many communication traffic. 

The total number of stages in the smart network is log 2 N + (4 + R) ■ log 2 N +1 where R is the 
tandem banyan parameter described in section 4. For example, for a 1024 port smart network 
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with cell loss rate of 10" 5 , R is set to 8 and the total number of stages is 221. The worst-case 
latency of this network is 381 clock cycles where 240 clock cycles are spent in the tandem 
banyan network and the latency in the remaining networks is 1 clock cycle per stage. 



STARLITE SMART 

NETWORK NETWORK 

Figure 8. STARLITE and smart network architectures. The smart network adds 
additional stages to the STARLITE and replaces the batcher banyan with the tandem 
banyan. 

In the smart network, packets use a complex header format with log 2 N bits allocated for group 
address, 1 bit for group priority, 6 bits for synchronization instructions, 1 bit to indicate if the 
packet contains active payload, 1 bit for packet priority, and log 2 N bits for destination address. 
The total size of the smart network packet header is 9 + 2-log 2 iV bits. For example, a 1024 
port smart network will require a 29 bit packet header. The size and format of the payload field 
of the smart network packet can be determined by the user. For example, to carry standard 
ATM payloads of 48 bytes, it should be set to 384 bits. 

The group and instruction fields in the packet header specify the synchronization operation to 
be performed on the packets. For packets that require no internal network processing (as in the 
case of one-to-one communication traffic), these fields are set to 0 and the packet moves 
through the processing section of the smart network without being modified. Then, it enters the 
tandem banyan MIN where it is routed to the output port specified in the destination address 
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field. Thus for one-to-one communication traffic, the smart network functions as a tandem 
banyan network with extra latency for moving packets through the processing MINs. 

On the other hand, when packets do require synchronization operations, the group and 
instruction fields in the packet header must be set accordingly. For example, consider the case 
where M processors attached to the smart network request a partial sum operation. First, each 
processor transmits a packet with the same predetermined number in the group address field 
and their own address in the destination address field. In addition, the ADD instruction field is 
set to 1 while other instruction fields are set to 0. The payloads contain the numbers to be 
added. The first MIN, the group sorting network, groups the packets based on their group 
address field. The second network, the ADD network, computes the partial sums of the for 
those packet groups that have the ADD instruction field set to 1. These networks will compute 
the partial sum for our M packets. Since these packets do not require other synchronization 
operations, they move through the remaining processing networks without being modified. 
Finally, the tandem banyan network delivers the M packets to their destinations (e.g. back to 
the senders). 



Batcher tandem Add Max And Mark Copy 
banyan 


Figure 9. Design complexity for the seven types of switching elements used in the smart 
network. These results were obtained using VHDL synthesis. 

The STARLITE architecture uses the perfect shuffle interconnection topology. The additional 
MINs required to implement the smart network architecture also rely on the perfect shuffle 
interconnection topology with the addition of some local near-neighbor interconnections in 
every stage. These local interconnections can be efficiently implemented using local electrical 
wires on the optoelectronic chips. As shown in figure 9, the 7 types of switching elements used 
in the smart network are simple, requiring less than 200 logic gates. Thus the smart network 
architecture can be efficiently implemented with our optoelectronic hardware module. The next 
section describes the detailed design of the smart network architecture. 

2.5.3. Detailed Architecture Design 
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A clock-accurate, gate-level design of the smart network architecture has been developed and 
verified using VHDL simulation and synthesis tools. The remainder of this section describes the 
operation of the various networks that make up the smart network and details their design. 

The first MIN in the smart network is the group sorting network 27 . This network positions 
packets that belong to the same group (e.g. have the same number specified in the group 
address field of the packet header) next to each other. This allows the processing networks that 
follow the sorting network to efficiently perform synchronization operations and broadcast 
communication. An example of a sorting network is shown in figure 10. 



group address. 


The sorting network operates on the group address and group priority fields in the packet 
header. It is constructed using log 2 iV stages of the perfect-shuffle interconnection 
network 10,2829 . Each stage contains N/2 sorting nodes with 2 input and 2 output ports. The 
sorting node compares the group address fields of the two incoming cells and send the lowest 
numbered cell to a predetermined output port. Figure 12(a) shows the gate-level design of the 
switching element for the batcher sorting network. Since the design of sorting networks is 
widely known it will not be discussed further. 

An important feature of the sorting network is that it uses both group address and group priority 
fields in the sorting process. Thus, within a group, packets are ordered by priority. This 
provides a method of controlling the order in which packets in a group are processed. For 
example, consider the partial sum operation (see section 5.1). Packets with higher priority will 
be added before packets with lower priority. Thus higher priority packets will end up having 
smaller partial sum values. In an application such as a parallel queue implementation 3 , this 
mechanism can be used to implement priority services. 
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The group sorting network is followed by five processing networks that perform 
synchronization and broadcast operations. Four of these networks are constructed using log 2 N 
stages of the perfect-shuffle interconnection network. The fifth network, consists of a single 
stage that implements special internal processing. Each stage contains N processing nodes with 
2 input and 1 output ports. Additionally each stage also includes local connections between the 
processing nodes as illustrated in figure 11. There are five different processing node types (e.g. 
one type for every processing network). To perform their work, processing networks must 
examine group address, group priority and synchronization instruction fields in the packet 
header. The input data and the results of synchronization operations are stored in the packet 
payload. 



Figure 11. Interconnection network for the ADD, MAX, AND and COPY networks. This 
figure shows the ADD network that performs the partial add operation. Note the extra 
local links between switching elements. 

A common feature of all the processing networks is the method used by the processing nodes 
to determine if the two incoming packets need to be modified. To perform this function, the 
processing node compares the group address fields of the two packets. If Ihe group addresses 
do not match then the topmost packet is passed to the output port without modification. On the 
other hand, if the group addresses are identical, the processing node then checks that both 
packets have the appropriate synchronization instruction field set to 1. For example, the 
processing nodes in the COPY network check that both incoming packets have the copy 
instruction field set to 1. When this condition is met, the appropriate operation is performed on 
the payloads of the incoming packets and the topmost packet with a newly computed payload is 
passed to the output port. Otherwise, the topmost packet is again passed to the output port 
without modification. 
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The first processing network is the ADD network. This network implements the partial sum 
operation as described in section 2.5.1. Each processing node in this network includes a bit- 
serial adder required to compute the partial sum. At the output of this network, groups of 
packets that request the partial sum operation will contain the partial sum values in their 
payloads. The processing node for the ADD network is shown in figure 12(b). 

The second processing network is the MAX network. This network implements the fanin 
operation based on the MAX numeric function as described in section 2.5.1. Each processing 
node in this network includes a bit-serial comparator. During a MAX operation, this comparator 
is used to compare the payloads of the two packets. Then, the largest payload value is copied 
to the output port. The result of the MAX operation is to copy the largest payload value within 
the group into the packet at the top of the group. The processing node for the MAX network is 
shown in figure 12(c). 
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The third processing network is the AND network. This network implements the fanin 
operation based on the AND logical function as described in section 2.5.1. Each processing 
node in this network includes an AND gate. During an AND operation, this gate is used to 
perform a bit-wise and on the payloads of the two input packets. The result is then copied to 
the output port. The effect of the AND operation is to copy the bit-wise and of all payloads in 
the group into the packet at the top of the group. The processing node for the AND network is 
shown in figure 12(d). 

The last two processing networks are COPY and MARK networks. These two networks 
combine to implement the fanout operation (or broadcast communication) as described in 
section 2.5.1. The MARK network is used to identify and mark the master packet within a 
group. The COPY network is used to copy the contents of this master packet into other 
packets within that group. The MARK network consists of a single stage. The processing nodes 
for the two networks are shown in figures 12(e) and 12(f) respectively. The operation of the 
MARK and COPY networks has been previously described in the STARLITE design and will 
not be discussed further. 

The last portion of the smart network is concerned with routing the cells to their final 
destination. This is accomplished by the familiar tandem banyan network described in section 
2.4. This network uses the activity, destination address and priority fields in the packet header 
to route packets. 

2.6. The Hierarchical Network 

The smart network excels at performing synchronizations, however this performance comes at 
the expense of higher latency for one-to-one communication traffic. For example, consider a 
1024 channel smart network. This network has 221 stages (100 stages for the sorting network, 
41 stages for the processing networks, and 80 stages for the tandem banyan network). The 
lowest latency of the smart network is 171 clock cycles, compared with 30 clock cycles 
required in the tandem banyan network. The average latency of the smart network is 231 clock 
cycles versus 90 clock cycles required in the tandem banyan network. 

The higher latency is acceptable in wide area networks, where the processors are separated by 
long distances and therefore the signal flight time delay is the dominant latency parameter. In 
local area networks, this latency may not be acceptable. Here we describe a hierarchical 
network architecture that combines the smart network with the tandem banyan network to 
achieve both hardware assisted synchronization operations and latency one-to-one 
communication. The concept of hierarchical networks has been previously proposed in 
reference 31. 

Figure 13 shows the architecture of the hierarchical network. The basic idea is to separate 
incoming traffic into one-to-one communication and synchronization (including multicast) 
traffic. The input controller directs one-to-one communication traffic to be routed by the 
tandem banyan network, while the remaining traffic is directed to the smart network. With this 
approach the average latency for communication traffic is 90 clock cycles, while 
synchronization and multicast traffic have average latency of 231 clock cycles. Thus this 
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approach offers good performance to both types of traffic present in distributed computing and 



Output 

Line 


Figure 13. Hierarchical network architecture combines tandem banyan and the smart 
network in a parallel configuration. 


2.7. Conclusions 

In this paper, we have examined three network designs. We developed these designs for 
different applications and thus they have different cost, performance and functionality 
characteristics. A common feature of all these designs is their use of a common optoelectronic 
hardware module as their building block. This module implements the perfect shuffle network 
and has been previously shown to outperform electronic implementations at the chip and MCM 
levels of the packaging hierarchy 2 . 



Tandem 

Banyan 

Smart 

Network 

Hierarchical 

Network 

Network Size 

1024 

1024 

1024 

Cell Loss Rate for 
Uniform Traffic 

iq- 5 

10' 5 

10“ 5 

1-1 comm, 
operations 

yes 

yes 

yes 

Broadcast 

Operations 

no 

yes 

yes 

Synchronization 

Operations 

no 

yes 

yes 

Number of Stages 

80 

221 

301 

Latency 

(worst-case) 

240 

381 

240 for 1-1 comm. 
381 for other 

comm 

Latency 

(best-case) 

30 

171 

30 for 1-1 comm. 
171 for other 

comm 


Table 1. Summary of proposed MIN architecture designs. 
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Table 1 shows the number of stages, worst-case and best-case latencies for the proposed 
architectures when N is 1024 and cell loss rate is kept at 10' 5 . The number of stages required 
for the tandem banyan network must be determined through time consuming computer 
simulation. However, we expect that the relative performance and cost characteristics of these 
architectures will remain unmodified as N is changed. 

A number of issues remain to be solved in our architecture designs. The first issue is to reduce 
the latency in the tandem banyan network. The second issue is to reduce "hot spot" contention 
that results from traffic that is not associated with synchronization operations. It turns out that 
both of these problems can be solved by a simple modification in the tandem banyan design. 
We have done preliminary work in this area and demonstrated dramatic reduction in latency at 
the expense of increased hardware cost 32 . 

Another challenge that remains to be addressed is reduction of the 3 clock cycle latency 
incurred in the tandem banyan switching element. In the current design, the use of longer 
priority fields leads to higher latency. We have also performed initial work in this area, 
demonstrating a design that uses parallel optical channels to transmit the packet. This design 
can reduce the switching element latency to one clock cycle without any appreciable increase in 
hardware cost 33 . 

It might also be argued that the optoelectronic implementation of the proposed networks would 
be prohibitively expensive due the large number of MIN stages and the high cost of optical 
interconnect devices required at every stage. However, it has been shown that a large shuffle 
networks can be decomposed into many smaller shuffle network interconnected with the 
shuffle topology 34 . In reference 14, we used this idea to partition the MIN system, 
implementing small electronic shuffles within a single chip and using ffee-space optical 
interconnects to link these electronic shuffles. This approach dramatically reduces the number 
of optical stages in the MIN. For example, the tandem banyan discussed in this paper can be 
implemented using 16 optical stages instead of 80 stages mentioned earlier in this paper. 

The adoption of reusable generic components for building high-performance optoelectronic 
systems will be critical to the success of this technology. The building block approach allows 
the same packaging scheme and optical interconnect devices to be used in several applications, 
thus leveraging the development cost across a large number of potential users. In this paper we 
have considered an application area of switch fabrics for computer networks and shown that a 
generic optoelectronic hardware module can be used to implement architectures with various 
application specific requirements. 

3. Comparison with Electronics 

In this section we focus on the design of optically interconnected MCMs for gigabit ATM 
switching networks. Our approach is to design a generic hardware module that Can be used to 
implement ATM switches with application-specific functionality, cost and performance 
requirements. The module design is partitioned on the MCM such that it can be built using 
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VLSI chips interconnected with holographic free-space optical interconnects. Holographic 
optical interconnects are also used for inter-MCM communication. A comparison of our 
approach with electrical MCM, all-optical, and guided-wave implementations of ATM switches 
is presented. 

3.1. INTRODUCTION 

The networking industry is undergoing a dramatic change. The increasing popularity of 
distributed workstation computing, metacomputing, multimedia, and video teleconferencing 
applications coupled with the availability of 200+ MIPS workstations and gigabit fiber optic 
link s are exhausting the capacity of present switching systems. Future network applications will 
require switches with large number of physical ports (1K-10K) operating at gigabit data rates 
(1-10 Gbps/port) and achieving terabit aggregate bandwidth capacities (1-10 Tbps). 

Scaling present electronic switches to meet future networking requirements is a formidable 
challenge. On the technology front, physical packaging constraints of electronics (i.e. cross-talk, 
clock-skew, signal attenuation, limited chip pin-outs, connector limitations on PCB's, etc.) limit 
the connection density-bandwidth product that can be achieved within a switch (see table 
j^Error! Bookmark not defined., 35 . On the architecture front, current switch designs suffer from 
performance and/or cost bottlenecks if scaled beyond several hundred physical ports. 


Switch 

State-of-the-art electronics 

Optically Interconnected MCMs 

Packaging 

Level 

signal 

density 

data 

rate 

density 

speed 

product 

signal 

density 

data 

rate 

density 

speed 

product 

switch chip 

6410s 

1000 Mbps 

64 

Gbps/chip 

1024 IOs 

1000 Mbps 

1024 

Gbps/chip 

inter-chip 

800 

lines/cm 

500 Mbps 

400 Gbps 
x lines/cm 

10,000 

lines/cm 

2500 Mbps 

25 Tbps 
x lines/cm 

inter-MCM 

150 

lines/cm 

500 Mbps 

75 Gbps 
x lines/cm 

10,000 

lines/cm 

2500 Mbps 

25 Tbps 
x lines/cm 

inter-board 

40 

lines/cm 

500 Mbps 

20 Gbps 
x lines/cm 

1,000 

lines/cm 2 

2500 Mbps 

2.5 Tbps 
x lines/cm 2 


Table 1: Interconnect capability comparison 


This paper describes how optically interconnected MCM technology can be used to build 
switches 

that will efficiently meet future networking requirements. This technology is based on 
combining advanced packaging techniques (e.g. flip-chip MCMs), high-speed submicron 
VLSI circuits (for switching) and gigabit surface-normal optical interconnects (for 
inter-chip, inter-MCM, and inter-board communication). Our approach is to develop standard 
optoelectronic components and packaging schemes that can be used to build high-performance 
application specific switches. 
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The section is organized as follows: section 3.2 briefly reviews the technology used in our 
design. Sections 3.3 and 3.4 describe the organization of an ATM switching network. Section 
3.5 describes previously proposed electronic ATM switch fabrics. Section 3.6 describes our 
design for an ATM switch fabric. Section 3.7 compares our design with electrical MCM, all- 
optical and guided-wave approaches. Section 3.8 provides initial results of our effort to build a 
prototype system based on the proposed approach. Finally, section 3.9 presents our 
conclusions. 

3.2. OPTICALLY INTERCONNECTED MCMs 

Our design uses optically interconnected multichip module technology being developed at UNC 
Charlotte under DARPA funding. Here we give provide only a brief description necessary to 
design our ATM switch system. 

Figure 1 shows the proposed packaging approach, whereby multiple MCMs (called translator 
modules) are attached to the holographic PC board 36 . Holographic optical interconnects are 
used for chip-to-chip communication within and MCM as well as for MCM-to-MCM 
communication. Note that the same interconnect packaging is used for both levels of the 
packaging hierarchy thus providing similar interconnect density at both these levels. In our 
approach we use 2-D arrays of VSCELs directly bonded on top of the switching chips to 
increase the I/O capability of the switch chips. 



Figure 1. Multiple optically interconnected MCMs packaged on a holographic PC board. 

An optical connection is made by transmitting an electrical signal from a VLSI chip through a 
flip-chip bonding pad and onto the translator module. The electrical signal activates a laser 
diode on the bottom side of the translator module. The laser generates an optical beam directed 
toward the holographic optical lens on the top side of the translator module. The lens collimates 
the optical beam and directs it to the holographic PC board. The beam is them directed onto an 
appropriate detector subhologram on the holographic PC board (after reflection from the planar 
mirror). The beam then passes through a holographic optical lens on a possibly different 
translator module that focuses the light onto the detector on the receiving VLSI chip. For long 
optical connections multiple reflections will be used to maximize the optical interconnection 
density. Multiple reflections can be achieved by placing metallic regions on the bottom side of 
the holographic PC board substrate. 
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3.3. NETWORK ARCHITECTURE 

The high level architecture of the optoelectronic ATM network is illustrated in figure 2. It 
consists of multiple optoelectronic switch fabrics, input and output controllers (integrated within 
the buffer controller), and the system controller. Multiple switch fabrics are used for cell routing 
in order to improve performance and reliability. The system controller is used for higher level 
functions, such as network management and fabric testing. The function of the buffer 
controllers is to provide an external optical I/O interface, cell buffering and contention 
resolution mechanism. 



Figure 2. ATM switch architecture. 


The optoelectronic switch fabrics and the buffer controllers are optically interconnected and 
packaged using the packaging scheme described in section 2. The use of optoelectronic 
packaging allows the entire switch to be packaged in a small volume, thus achieving low 
interconnect latencies and high clock rates. Inter-chip and inter-MCM holographic optical 
interconnects are used to provide the required internal wiring density and bandwidth. We note 
that figure 2 does not shown additional hardware modules required for interfacing OC-48 or 
OC-192 connection to the buffer controller. The function of these modules is to provide an 
ATM framer, VCI and VPI translation, and clock recovery. 

3.4. SWITCH FABRIC ARCHITECTURE 

A number of switch fabric architectures have been previously proposed for ATM switches. Our 
focus here is on the implementation of a certain class of self-routing switch fabrics, herein 
called banyan-based switch fabrics, that use the banyan or a functionally equivalent 
interconnection topology. Examples of topologies that are equivalent to the banyan include 
shuffle-exchange (or perfect shuffle), omega, flip, cube, and baseline 3 . Our decision to use the 
banyan topology is motivated by their implicit simplicity and their scalability to large number of 
ports. 

Banyan-based switch fabrics cover a large class of switch fabric architectures, sometimes called 
multistage interconnection networks, that exhibit various performance and cost characteristics. 
The specific architecture to be used for the optoelectronic switch can be chosen to fit 
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application-specific functionality and to optimize system performance/cost. Possible choices 
include cascaded banyan, tandem banyan, and batcher banyan architectures. Multiple switch 
fabrics can also be considered to improve network performance and reliability. 

The cascaded banyan is a simple architecture requiring N-log 2 N 2x2 switching elements to 
construct an N channel network 38 . Although it uses a distribution network to randomize the 
incoming cell traffic, it is an internally blocking network. Moreover, the likelihood of internal 
blocking increases as the switch size is scaled up. An alternative, non-blocking architecture is 
the batcher-banyan network 39 . The drawback of this architecture is its higher complexity, 

because A/2-(log 2 JV + l)-log 2 N 2x2 switching elements are required. If some blocking is 

acceptable, one can design a switch fabric that uses less hardware than the batcher-banyan 
while achieving nearly identical performance (e.g. the amount of blocking is a design 
parameter). This architecture is the tandem banyan network 40 . 

Our design will be built using pipelined and unbuffered KxK crossbar switching elements 41 > 4243 , 
thus achieving simple and high-speed switching element design. Increasing the switching 
element size (K), reduces the total number of switching elements at the expense of higher 
individual switching element complexity. Increasing the channel width (W) allows the switch to 
operate at lower internal rates at the expense of higher interconnect density requirements. The 
switch size (K) and channel width (W) for the optoelectronic switch can be varied to optimize 
performance/cost for a given application. 

Congestion in the switch fabric is an important design concern for gigabit switches 44 . 
Congestion can occur within the switch fabric due to internal contention or when several input 
ports send cells to the same output port. Possible solutions to congestion problem are to buffer 
the cells within the switch fabric or to notify the input ports of congestion. The latter approach 
is preferred in gigabit networks and will be used in our design. 

3.5. ELECTRONIC SWITCH FABRICS 

To put our approach in proper perspective, we first review electronic switch fabrics. Most 
electronic implementations of large switch fabrics use multiple switch chips interconnected on 
one or more PCBs or MCMs. Figure 3 illustrates this approach, showing a 64x64 (e.g. 64 
channel) perfect shuffle switch fabric constructed from 16 8x8 switch chips. Each switch chip 
contains 12 2x2 switching elements. Each 2x2 switching element is an integrated circuit 
containing several hundred logic gates. 
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Figure 3. Planar electronic packaging scheme 


A major problem with previous implementations of switch chips for banyan-based switch 
fabrics is the limited number of network channels that can be concurrently processed on-chip. 
For example, previous switch chips for batcher-banyan switch fabrics processed 32-64 on-chip 
network channels 45 - 46 - 47 . This limitation occurs for many reasons: 

1. The use of perimeter chip-to-substrate contact technology (i.e. wire bonding, TAB) limits 
the contact density from the chip-to-substrate. For example, a 1cm wide chip with 100pm 
contact pitch has at most 400 contact pads some of which have to be used for power and 
control signals. This leaves less than 400 contact pads for network channels. We note that 
although array chip-to-substrate contact technology (i.e. flip-chip) can be used here to 
increase the chip pin-outs, taking advantage of these additional pins will be difficult in 
light of chip power consumption and substrate connection density constraints. 

2. In high-speed chip designs, the chip power consumption is dominated by the output 
drivers that are required to drive transmission lines. The power required to drive a 
transmission line is about lOmW (e.g. V 2 /R=l 2 /100). Assuming a 5 watt power budget for 
the output drivers, at most 500 output channels are possible. 

3. Previous chip layouts of banyan and its equivalents used one-dimensional layouts 48 , that 
do not scale well in area and wire length when the number of on-chip channels is large. 
With 1pm design rules, chip area and chip speed become wire-limited when the number of 
on-chip channels is increased beyond 256. 

The limited number of on-chip network channels implies that a large number of switch chips 
will be required to implement a large switch fabric. For example, over 700 chips will be 
required to implement a 1024x1024 batcher-banyan switch fabric using 32x32 switch chips 

(i.e. [# stages] x chips/stage = [log 32 1024 • (log 2 1024 + l)] x 1024/32 = 704). Even with 64x64 
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switch chips, the total number of chips will still be about 300. This situation creates the 
following problems: 

1. A system with a large number of chips has high fabrication cost, low reliability, long 
off-chip wire lengths (and hence reduced clock speed), high power consumption, and large 
system size. If multiple PCBs or MCMs are required to contain all the chips, the situation 
becomes even worse because of the higher cost and lower density that is associated with 
using another level of the packaging hierarchy (i.e. inter-PCB or inter-MCM). 

2. Present schemes for interconnecting switch chips on the PCB or MCM require a large 
number of wires to cross over each other and to extend across the entire board. The 
relatively large interconnect pitch of PCB and MCM (typically 25|un for MCM-D and 
250pm for PCB vs. 2pm on-chip) and the limited number of wiring layers (typically 2 for 
MCM-D and 8 for PCB) can create a wiring congestion when interconnecting a large 
number of switch chips. 
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Figure 4. 3-D electronic packaging scheme 

A three-dimensional packaging scheme for implementing banyan-based switch fabrics has 
been previously proposed to overcome the problems of large system size, wiring congestion and 
long off-chip wire lengths (see figure 4) 49 . This packaging scheme works by arranging boards 
with multiple switch chips into columns and interconnecting them with orthogonal planes. The 
main disadvantage of the 3-D electronic packaging scheme is the large number of switch chips 
required. This leads to high fabrication costs, and high power consumption, and reduced 
reliability. The next section describes the proposed approach based on 3-D optoelectronic 
packaging. In this scheme, the number of switch chips required is dramatically reduced over 
3-D electronic packaging approach, because more network channels are processed on-chip. 
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3.6. Optoelectronic switch fabric 



Figure 5. 2-D optoelectronic chip layout scheme 


The previous section has identified the limited number of network channels in previous switch 
chip designs as the main impediment to efficient implementation of large banyan-based switch 
fabrics. In order to increase the number of on-chip network channels we propose to use a 
two-dimensional chip layout scheme, herein called 2-D banyan layout. As shown in figure 5 
this layout scheme is functionally equivalent to previous 1-D banyan layouts 50 . However, by 
uniformly distributing the I/O ports throughout the chip area, the 2-D layout scheme can 
accommodate larger and faster switches within a given area. The use of optical interconnect 
allows more pinout to handle the increased I/O requirements. These last two points are 
illustrated graphically in figure 6. We note that this section assumes bit-serial channels (W=l) 
and 2x2 switching elements (K=2) to simplify the discussion. 


Power Dissipation vs. Data Rate 


2-D vs. 1-D Chip Layout Comparison 




# Channels on Chip 


H Electrical Pin-Out 

- 1000 IOs = 50-250 Watts 
@ Optoelectronic Pin-out 

- 1000 IOs = 4-20 Watts 


Process technology 

1.2pm CMOS Double-level metal 

Minimum feature size (X) 
Transistors/switching element 
Area/transistor (A ) 

Switching element pitch (P) 
Wire pitcn (W) 

0.6lim 

100 

555 (2Cp.m x 10pm) 

4 (2.4 pm) 


Figure 6. Comparison of 2-D optoelectronic and 1-D electronic switch chip layouts 


30 













































A large switch fabric requires multiple banyan networks to be interconnected in series and/or in 
parallel. We can achieve this by optically interconnecting multiple 2-D layout switch chips 
packaged on single or multiple MCMs as shown in figure 7. In this approach, several banyan 
networks are packaged on a single holographic PC board. The 2-D layout scheme can be 
repeated at higher levels of the packaging hierarchy to build larger networks. Alternatively, 
series interconnection of switch chips or MCMs can be used to achieve many multiple stages 
required for tandem-banyan and batcher-banyan networks. Inter-chip communication within 
the MCM can be optical or electrical depending on interconnect speed and density 
consideration. Our approach will be to use electrical interconnects for short lines that can be 
treated as lumped capacitors. Multiple PC boards are stacked in parallel to construct the 
complete switch fabric. Inter-board communication is done with surface normal optical 
interconnects. The specific partitioning of the switch architecture onto the our packaging 
scheme is determined by the application and the specific performance/cost requirements. 



Figure 7. Optoelectronic switch fabric 

Our packaging scheme provides the following advantages: 

1. The 2-D layout is ideally suited to array contact pad capability of MCMs and 2-D optical 
I/O because it uniformly distributes the input and output ports throughout the chip area. 
With array contacts, the I/O pad density limit of present designs is removed. For example, 
a 1cm diameter chip with 100m contact pitch has at most 10,000 contact pads vs. 400 for 
the perimeter contact case. 
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2. The use of 2-D layout allows larger networks to be concurrently processed on-chip. 
Moreover, these networks can operate at higher speed because of shorter on-chip wires. 
For example, for a 1024 channel switch chip in 1pm design rules, the 2-D layout chip is a 
square (1.25cm x 1.25cm) with 0.63 cm longest wire length. On the other hand, the 1-D 
layout chip is a rectangle with impossible-to-fabricate dimensions (5.1cm x 0.45cm) with 
2.70 cm longest wire length. 

3. The number of chips required to implement a large network is dramatically reduced. For 
example, a 1024 channel batcher-banyan network requires only 11 1024 channel switch 
chips vs. 300 chips required with previous designs using 64 channel switch chips. In this 
case, the entire network can be easily achieved on a single holographic PC board. This 
reduces the system size, power consumption and cost. 

4. The use of 3-D packaging allows dramatic reduction is system size and interconnect 
latencies involved in packaging many chips required for a large switch. 

5. Electrical interconnects are used for on-chip wiring and for short inter-chip connections on 
the MCM. The use of electrical interconnects at these packaging levels is advantageous 
because high interconnect density and low power consumption can be achieved with 
electrical interconnect for on-chip wiring as well as for short inter-chip connections (e.g. 
short inter-chip connections do not behave as transmission lines and thus do not have the 
power consumption and fabrication complexity). 

6. Optical interconnects are used for long chip-to-chip connections on the MCM, inter-MCM 
connections, and inter-board connections. The use of optical interconnect at these 
packaging levels allows higher connection density and lower power consumption than that 
possible with electrical interconnects as described in section 3. 

7. The proposed scheme has excellent scalability potential. For example, with 1pm design 
rules, we can implement a 4096 channel cascaded banyan network using 48 1024 channel 
switch chips on a single holographic PC board. Multiple 4096 channel cascaded banyan 
network can be achieved by stacking holographic PC boards with the 3-D packaging 
scheme. 

3.7. COMPARISON WITH OTHER APPROACHES 

3.7.1. Electronic MCMs 

To determine the usefulness of our approach we have compared it with an equivalent electronic 
implementation using state-of-the-art flip-chip electronic MCMs. Table 2 shows the 
assumptions made for electronic MCMs. Our electronic design uses 2-D layout switch chips 
interconnected using 2-D layout on the MCM (e.g. replacing optical interconnects in section 6 
with electrical wires). Figure 8 shows the cost of implementing a 4096 channel banyan switch 
using electronics. 
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TECHNOLOGY: 

SYSTEM DESIGN: 

THIN FILM HYBRID MCM TECHNOLOGY 

SHUFFLE NETWORK WITH 4096 CHANNELS, 

CMOS 1.2u CHIP TECHNOLOGY 

12 STAGES 

ON-CHIP WIRING PITCH = 4 microns 

REQUIRES TOTAL OF 64 CHIPS 

OFF-CHIP WIRING PITCH = 50 microns 

EACH CHIP IS A 64 NODE HYPERCUBE 

2 LAYERS OF SIGNAL INTERCONNECT 

768 I/O PINS PER CHIP 

CMOS OFF-CHIP DRIVERS HAVE 3V SWING 

1.2K GATES PER NODE 

FLIP-CHIP MOUNTING WITH lOOum PAD PITCH 

77K LOGIC GATES PER CHIP 

5M LOGIC GATES IN SYSTEM 

25K SIGNAL WIRES ON MCM 

MCM BISECTION WIDTH = 4096 WIRES 


Table 2. Electronic MCM assumptions 


Our results show that the electronic system power budget is dominated by the power required 
to drive the chip-to-chip electrical wires which behave as transmission lines. This system power 
bottleneck is greatly eased by the use of optical interconnects which as shown in figure 6 
consume much lower power than their electrical counterparts. Likewise, the system size is 
dominated by the MCM substrate wiring (e.g. chips have to be widely space apart to 
accommodate the necessary chip-to-chip wiring). In this case, the higher density of optical 
interconnects allows a more compact package to be implemented than possible with electronics. 
Finally, the system clock budget is dominated by the driver latency which can be expected to 
reduce with lower power optical interconnects. Our comparison is done at the MCM level, but 
as shown in table 1, we expect that the benefits of using optical interconnects will be even 
greater for higher levels of the packaging hierarchy. 
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E-MCM SYSTEM AREA BUDGET 


E-MCM SYSTEM POWER BUDGET 



E-MCM SYSTEM CLOCK BUDGET 


1720 



DELAY DELAY 


Figure 8. Comparison summary 

3.8. CONCLUSION 

This section has presented generic hardware module for building gigabit ATM svitches. The 
design is based on optically interconnected MCM technology that proivdes the dense and high¬ 
speed I/O capability at multiple levels of the packaging hierarchy to meet the demands of the 
ATM switching application. We have compared our approach with electronic MCM, waveguide 
and all-optical switches showing that for large number of channels operating at gigabit channel 
data rates, our approach becomes attractive from performance and cost considerations. 
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